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Abstract Sparse principal component analysis addresses the problem of find¬ 
ing a linear combination of the variables in a given data set with a sparse 
coefficients vector that maximizes the variability of the data. This model en¬ 
hances the ability to interpret the principal components, and is applicable in 
a wide variety of fields including genetics and finance, just to name a few. 

We suggest a necessary coordinate-wise-based optimality condition, and 
show its superiority over the stationarity-based condition that is commonly 
used in the literature, and which is the basis for many of the algorithms de¬ 
signed to solve the problem. We devise algorithms that are based on the new 
optimality condition, and provide numerical experiments that support our 
assertion that algorithms, which are guaranteed to converge to stronger op¬ 
timality conditions, perform better than algorithms that converge to points 
satisfying weaker optimality conditions. 

Keywords optimality conditions • principal component analysis • sparsity 
constrained problems • stationarity • numerical methods 


1 Introduction 

Principal component analysis (PCA) is a well known data-analytic technique 
that linearly transforms a given set of data to some equivalent representa¬ 
tion. This transformation is defined in such a manner that any variable in the 
new representation, called a principal component (PC), expresses most of the 
variance in the data, which is not expressed by the PCs that precede it. The 
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linear combination defining each of the PCs is given by a coefficients (also 
termed loadings) vector. In terms of the covariance (or correlation) matrix of 
the data, the coefficients vector of the k-th PC is the eigenvector that corre¬ 
sponds to the /c-th largest eigenvalue [1]. One major drawback of PCA is that 
commonly the coefficients vectors are dense, i.e., each PC is a linear combina¬ 
tion of much, if not most, of the original variables, which causes a difficulty in 
interpreting the obtained PCs. This disadvantage encouraged a wide interest 
in the sparsity constrained version of PCA, which imposes an additional con¬ 
straint, enforcing the coefficients vector not to exceed some predetermined 
sparsity level s. 

Enforcing sparsity on the coefficients vector is commonly acceptable in 
some applications. For example, in the exploration of micro-array gene ex¬ 
pression patterns, PCA is employed in order to classify different tissues ac¬ 
cording to their gene expression. It is also desirable that such discrimination 
can be executed by utilizing only a small subset of the genes, thus encourag¬ 
ing sparse solutions [2]. The desire to obtain interpretable coefficients vectors 
is not the only reason to favor the sparse PCA model. For example, some fi¬ 
nancial applications will prefer sparse solutions in order to reduce transaction 
costs [ 3 ]. Clearly, incorporating an additional sparsity constraint will provide 
a PC that, generally, does not explain all of the variance which is explained 
by the regular PC; nevertheless, in such applications, this sacrifice is accept¬ 
able with respect to the obtained benefits. We refer to this formulation as the 
sparsity constrained formulation, and it is merely one of several alternative 
formulations considered in the literature. The common alternatives are the re¬ 
sult of treating the sparsity term, or its relaxation, by a penalty approach. The 
sparse PCA problem is a difficult non-convex problem, and can be optimally 
solved only for small scale problems by performing exhaustive or a branch 
and bound search over all possible support sets [4]. Thus, in order to handle 
large scale problems, the algorithms proposed in the literature are seeking to 
find an approximate solution. One of the first methods, suggested by Cadima 
and Jolliffe [ >], is to threshold the smallest, in absolute value, elements of 
the dominant eigenvector. Unfortunately, this remarkably simple approach is 
known to frequently provide poor results. In [ 4 ] Moghaddam et al. proposed 
several greedy methods. An advantage of these methods is that they pro¬ 
duce a full path of solutions (i.e., a solution for each of the values of sparsity 
level up to s), but the necessity to perform a large amount of eigenvalue com¬ 
putations at each step render them quite computationally expensive. In [6], 
d'Aspremont et al. proposed an approximate greedy approach that obviates 
the necessity to perform most of the eigenvalue computations by evaluating 
a lower bound on the eigenvalues, which results in a substantial reduction 
of computation time. Another approach presented by d'Aspremont et al. in 
[ 6 ], and earlier in [ 7 ], is to consider a semidefinite programming formulation 
with a rank constraint for some of the relaxed and / or penalized models of 
PCA. These equivalent formulations are still hard non-convex problems, and 
thus a relaxed model is solved and an approximate solution is derived for the 
original problem. The algorithms used to solve the SDP relaxations are not 
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applicable for large scale problems, rendering this approach as non-scalable. 
In [ 8 ], encouraged by the LASSO approach suggested for regression [ 9 ], Jol- 
liffe et al. proposed the absolute value norm constrained formulation under 
the name SCoTLass (simplified component technique LASSO), which is a re¬ 
laxation of the sparsity constrained problem. In practice, the numerical study 
was conduced on the penalized version by implementing the projected gradi¬ 
ent algorithm. The relaxed model was further considered in the literature. An 
alternating minimization scheme to solve the constrained formulation was 
proposed in [ 10 ]. Another work that addressed the constrained formulation 
was motivated by the expectation maximization algorithm for probabilistic 
PCA [ 11 ]. Even though the work addressed the constrained formulation, the 
sequence generated by the method in [ 1 ] is guaranteed to be s-sparse. Penal¬ 
ized versions were also considered extensively. In [ 12 ] Zou et al. formulated 
the sparse PCA as a regression-type model, where the f-th principal compo¬ 
nent was approximated by the linear combination of the original variables. 
A LASSO and ridge penalties are imposed on the coefficients vector forming 
the elastic net model that generalizes the LASSO [ 3 ] and an alternating min¬ 
imization algorithm, called SPCA, was proposed. In [ 14 ] Shen and Huang 
proposed several iterative schemes to solve the penalized versions via regu¬ 
larized SVD. These methods were considered further in [ 15 ], where a gradient 
scheme was proposed and a convergence analysis, that was missing in [ 14 ], 
was also provided. 

Recently, Luss and Teboulle showed in [ 16 ] that the seemingly different 
methods proposed in [ 15 , 14 , 11 , 17 , 10 , 12 ] are some particular realizations of 
the conditional gradient algorithm with unit step-size. The work [ 16 ] pro¬ 
posed a unified algorithmic framework which they refer to as ConGradU 
(the well known conditional gradient with unit step size) and established 
convergence results, showing that the algorithm produces a point satisfying 
some necessary first order optimality criteria. Some novel schemes are pro¬ 
vided. One of them addresses directly the sparsity constrained formulation 
of sparse PCA. 

As already mentioned, none of the methods listed above can guarantee to 
produce an optimal solution. In addition, the sparse PCA problem does not 
seem to posses a verifiable necessary and sufficient global optimality condi¬ 
tion, and hence, in general, there is no efficient way to check if a given vector 
is the global optimal solution 1 . Therefore, the comparison of the methods in 
the literature is based solely on numerical experiments without providing 
any theoretical justification for the advantage of a certain method over the 
others. However, most of the algorithms just listed will produce a solution 
that satisfies some necessary optimality condition. In a recent work. Beck and 
Eldar [ 18 ] employed some of the aforesaid conditions in order to provide 
an insight regarding the success of the corresponding algorithms. Under the 
framework of minimizing a continuously differentiable function subject to a 
sparsity constraint, several necessary optimality conditions were presented. 


1 In [h] the authors suggested a sufficient optimality condition. 
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The relations between the different optimality conditions were established, 
showing that some of the conditions are stronger (that is, more restrictive) 
than others. An extension to problems over sparse symmetric sets was con¬ 
sidered in [19]. In this paper, we adopt this methodology in order to establish 
a hierarchy between two necessary optimality conditions for the sparsity con¬ 
strained sparse PCA problem. The first condition that we consider is a well 
known first order condition, that was originally presented in the context of 
the sparse PCA problem in [16]. We will refer to it as the complete (co) station- 
arity condition. Much of the existing algorithms in the literature are actually 
guaranteed to converge to a co-stationary point. The second condition, which 
we call coordinate-wise (CW) maximally, is a generalization of one of the con¬ 
ditions considered in [18], and it essentially states that the function value 
cannot be improved by making changes of at most two coordinates. 

In the following section we will explicitly define the conditions under con¬ 
sideration. In Section 3, we will establish the relation between the conditions, 
showing that the CW-maximality condition is stronger (that is, more restric¬ 
tive) than co-stationarity. In Section 4, we will introduce algorithms that pro¬ 
duce points satisfying the aforementioned conditions and finally, in Section 
5, we will provide a numerical study on simulated and real life data that sup¬ 
ports our assertion that algorithms that correspond to stronger conditions are 
more likely to provide better results. 


2 Necessary Optimality Conditions 

Throughout the paper, we consider the following sparsity constrained prob¬ 
lem: 


max{/(x) :xeS], (P) 

where / is a continuously differentiable convex function over K" and 

S := {x € R" : ||x|| 2 < 1, ||x|| 0 < s}, 

with || • 11o being the so-called Z 0 - n orm defined by ||x|| 0 := \{i : Xi ^ 0} | 2 . 
As a special case, when the objective function is chosen as /(x) = x v Ax, 
where A is a given positive semidefinite matrix, problem (P) amounts to the 
l 0 -constrained sparse PCA model: 

max{x T Ax : x 6 S}. (SPCA) 

In PCA applications, A usually stands for the covariance matrix of the data. 

In this section, we will present two necessary optimality conditions for 
the general model (P). Although our main motivation is to study the sparse 
PCA problem, we will nonetheless consider the general model (P), since our 
results are also applicable in this general setting. 

2 Note that the Zo-norm is not actually a norm since it does not satisfy the absolute homogene¬ 
ity property. 
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Prior to presenting the optimality conditions, we will introduce in the fol¬ 
lowing subsection some notation and definitions that will be used in our anal¬ 
ysis. 

2.1 Notation and Definitions 

A subvector of a given vector x £ R" corresponding to a set of indices 
T C. {1.2..... n\ is denoted by xj-. Similarly, we will denote the subvector 
of the gradient V/(x) corresponding to the indices in T by Vt/(x). The sign 
of a given a £ R. is denoted by sgn(a) and is equal to 1 for a > 0 and — 1 
for a < 0. The support set of some arbitrary vector x will be denoted by 
ii(x) = {i : Xi ^ 0} and its complement by 7 0 (x) = {« : Xi = 0}. For a given 
vector x £ R n and an integer s £ {1,2,..., n — 1}, we will define M s (x) to 
be the s-th largest absolute value component in x. For such x and s, we will 
define the sets 7>(x, s), 7 = (x. s) and 7<(x, s) as follows: 


7>(x,s) := {* : \xi\ > M s (x)} 



ll x llo > S, 
ll x llo < S, 



{* : \%i I < || x llo > S, 

{i : Xi = 0}, 11 x 11 o < s. 


We will also define the set 7>(x, s) := I >(x, s) U 7 = (x, s) and the set 
7<(x, s ) := 7<(x, s)U7 = (x, s). Obviously, the sets 7>(x, s), 7 = (x, s ) and 7<(x, s) 
form a partition of {1, 2,..., n}. Furthermore, when ||x||o < s, we have that 
7>(x, s) = 7i(x),7=(x) = 0 and 7<(x, s) = 7 0 (x). 

The sets defined above posses some convenient and elementary properties 
which are given in Lemma 2.1 below. Since all the properties stated in the 
lemma are rather simple consequences of the definition of the sets 7>( x , s), 
7 = (x, s), 7<(x, s), the proof is omitted. 

Lemma 2.1 1. If x ^ 0, then 7>(x, s) 0. 

2. If |7> (x, s) | < s then Xj = 0 for all j £ 7< (x, s). 

3. For any i £ 7>(x, s),j £ 7 = (x, s) and k £ I <(x, s), it holds that 
\Xi\ > \xj\ > \x k \. 

We will frequently use the notation 



for the set containing all the subsets of indices corresponding to the nonzero 
s largest in absolute value components of a given vector x. When ||x||o < s, 
there are no more than s nonzero elements in x, and the above definition 
actually amounts to i? s (x) = {7i(x)}. However, when ||x||o > s, there might 
be more than one set of indices corresponding to the s largest absolute value 
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components of x. For example, consider the vector x = (3, 2,1,1,1, 0, 0) T and 
the sparsity level s = 3. Then, 

/fe(x) ={{1,2,3}, {1,2,4}, {1,2, 5}}. 

On the other hand, in the following examples, the set contains a single subset: 

f? 3 ((0,-5,4,-3,2,0) T ) = {{2,3,4}}, f? 3 ((0,0,4, — 3,0,0) T ) = {{3,4}}. 

The hard thresholding operator maps a vector x e E” to the set of vectors 
that are generated by keeping the s largest absolute value components of x 
and setting all the others to zeros. This operator, which we denote by H s , is 
formally defined by 

H s (x):= |J {y : y r = x T ,y f = 0}. 

reflex) 


Thus, for example. 


JT 3 ((3,2,l,l,l,0,0) r ) = 

{(3,2,1,0,0,0, Of, (3,2,0,1,0,0, Of, (3,2,0,0,1,0,0) T }. 


2.2 Complete (co) - Stationarity 

The first condition that we consider was presented for the sparse PCA prob¬ 
lem in [16]. We refer to it as the complete (co) stationarity condition. 

Definition 2.1 (co-stationarity) Let x be a feasible solution of (P). Then, x is 
called a co-stationary point of (P) over S if and only if it satisfies: 

(V/(x), v — x) < 0 Vv e S. 

This is probably the most elementary first order condition for constrained dif¬ 
ferentiable optimization problems. The work [16] provided a unified frame¬ 
work for several algorithms designed to solve different formulations of sparse 
PCA. Actually, [16] considered the co-stationarity condition over a general 
nonempty and compact set instead of S, and for this general case, the follow¬ 
ing proposition, which was originally established in [20], was recalled. This 
result follows from the convexity of the objective function. 

Proposition 2.1 Let f : R n —»• Rbe a continuous differentiable and convex func¬ 
tion over R ra , and let C be a nonempty and compact set. If x is a global maximum of 
f over C , then x is a co-stationary point over C, meaning that (V/(x) , v - x) < 0 
for any v € C. 
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2.3 CW-Maximality 

The second necessary optimality condition that we will consider is coordinate- 
wise maximality. This optimality condition is in fact a type of a local opti¬ 
mality condition, stating that a given point x is a minimizer over a neigh¬ 
bourhood consisting of all feasible points, that are different by at most two 
coordinates. We will denote the corresponding neighbourhood by 

^(x) := {z : ||z - x || 0 < 2,z € S}. 

The formal definition of a CW-maximum point follows. 

Definition 2.2 (CW-maximum point) Let x be a feasible solution of (P). Then, 
x is called a coordinate-wise (CW) maximum point of (P) if and only if 
/( x ) > /(z) for every z e ^(x). 

Obviously, CW-maximality, by its definition, is a necessary optimality condi¬ 
tion. 

Proposition 2.2 Let x be an optimal solution to (P). Then, x is an CW-maximum 
point. 

Instead of considering the neighbourhood S 2 (x) in the definition of 
CW-maximality (Definition 2.2), we could have alternatively considered 
larger neighbourhoods consisting of vectors that differ from x by at most k 
coordinates for some 2 < k < s: 

S'fc(x) := {z : ||z - x || 0 <k,z£ S}. 

A similar optimality condition over such a neighbourhood can be defined, 
and clearly since S t (x) C Si- (x) for any t < k, considering neighbourhoods 
that differ by a larger amount of coordinates will result in stronger optimality 
conditions. Note that the amount of comparisons required in order to verify 
that a vector x £ with a full support (/i(x) = s) is CW-maximal (k = 2) 
is 0(s ■ n), while changing the neighbourhood to S 3 will increase the amount 
of comparisons to 0(s • n 2 ). Hence, considering such a stronger optimality 
condition has a substantial computational price. Keeping in mind that we 
seek scalable conditions and algorithms, we restrict the discussion to the case 
k = 2 . 

3 Optimality Conditions Hierarchy 

Our main result in this section is that CW-maximality is a stronger (that is, 
more restrictive) optimality condition than co-stationarity. This result also has 
an impact on the performance of the corresponding algorithms in the sense 
that, loosely speaking, algorithms that are only guaranteed to converge to a 
co-stationary point are less likely to produce the optimal solution of the prob¬ 
lem than algorithms that are guaranteed to converge to a CW-maximal point. 
In Section 5, we will show that the numerical results support this assertion. 
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3.1 Technical Preliminaries 

We will begin by providing some auxiliary technical results that will be used 
in order to establish the main result. Lemma 3.1 is a trivial result, that follows 
directly from the Cauchy-Schwarz inequality (see also Lemma 4.1 in [16]). 

Lemma 3.1 Suppose that 0 / q e I d and p > 0. Then, the optimal solution of the 
optimization problem 

max{q T x : ||x|| 2 < p}, (QCLP) 

xGR d 

is given by x* = with the optimal value o/p||q|| 2 . 

The following simple lemma is an extension of Proposition 4.3 from [16]. 

Lemma 3.2 Assume that 0 ^ p G R". Then, the set of optimal solutions of the 
optimization problem 

max {p T x : ||x|| 0 < s, ||x|| 2 < 1}, (S-QCLP) 

xGM n 

is given by 

X *(p, s) := |: x G fT s (p)| , 

ivith the optimal value of ||pr|| 2 , where T G R s { p). 

Proof We can write (S-QCLP) as 

max max jp T x : ||x|| 2 < l,/i(x) C T) . (1) 

T C {l,...,n}x£l" 1 J 

\T\ < s 

According to Lemma 3.1, for each T C {1,..., n} satisfying \T\ < s, the opti¬ 
mal value of the inner optimization problem is ||pr|| 2 , and if py f 0, then a 
solution x* to the inner optimization problem is given by 

X ^ = ^TT’ x ? = 0 - ( 2 ) 

IIPt||2 

The problem (1) thus reduces to 

max ||pr|| 2 - (3) 

T C {1, . .. ,«} 

\T\ < s 

Obviously, when |jp||o > s, the optimal solutions of the latter problem are all 
the sets containing the indices of components corresponding to the s largest 
absolute values in p, and when ||p|| 0 < s, the unique optimal solution is 
I\ (p). Thus, the set of all optimal solutions of (3) is i? s (p). Noting that p r / 0 
for any T G R s { p), we conclude that the optimal solutions of (S-QCLP) are 
given by (2) with T being any set in R s ( p), which are exactly the members of 
X*(p,s). □ 
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Our final technical lemma states that, if a given vector x is not an optimal 
solution of the problem of maximizing a linear function over the unit norm, 
then there must be two indices i j for which the subvector ipp is also 
not an optimal solution for the problem restricted to the the variables Xi, x 3 
(while fixing all the other variables). This lemma is rather simple, but will 
play a key role in the proof of the main result. 

Lemma 3.3 Let q £ and p > 0. Suppose that x satisfies ||x|| 2 < p, and that it 
is nofan optimal solution of (QCLP). Then, there exist indices i,j(i j) such that 
xp jy is not the optimal solution of 



Proof Since x is not the optimal solution of (QCLP), we obtain that q f 0 
(since otherwise, if q = 0, all feasible points are also optimal). Thus, the set 
I\ (q) is nonempty. We will split the analysis into two cases. 

- If 11 Sc 11 2 < p, then take any i £ /i(q) and j i, and we can write 


1/2 



which together with qpp} f 0 (since i £ l \ (q)) implies that xpp} is not 
the optimal solution of (2-QCLPp pj), since we have, by Lemma 3.1, that 
the constraint at the optimal solution must be active. 

- If, on the other hand, ||x|| 2 = p, then assume in contradiction that for 
each i j the vector xppj is the optimal solution of (2-QCLPpp ). Take 
some i £ !\ (q). For any j £ /o(q), we know that xpp is the optimal 
solution of (2-QCLPp p) and thus, according to Lemma 3.1 (employed on 
the problem (2-QCLPpp)), it must in particular satisfy ap = 0, that is, 
j £ /o(x). To summarize. 


Xj = 0 for any j £ 7 0 (q). 


(4) 


Now, for any j £ I \ (q), according to Lemma 3.1, xp p must satisfy 



(5) 


where here we used the fact that p 2 — Yli^i j = x 2 + x 2 . Squaring both 




for any j £ /i(q). 
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By (4), Xj = 0 whenever j £ /o(q), and we can therefore write 


•Ej 2 5 j 1, 2, . . . , 71. 

3 Qt 

Summing over j = 1,2...., n, and using the fact that ||x||| = p 2 , it follows 
that 

n 2 

j=i 


implying that 


x 2 = n 2 -^— 

1 p llqlli ’ 


which combined with the fact that that sgn(x,) = sgn(r/ 7 ) (see (5) ), yields 

Qi 


Xi = p-. 


|q 2 


Since we actually proved the latter for an arbitrary i £ I\ (q), and since 
Xi = 0 for any i £ J 0 (q) (see (4)), it follows that 


x = p T 


|q|h 


in contradiction to the assumption that x is not an optimal solution of 
(QCLP). 

□ 


The following corollary is a direct consequence of Lemmas 3.2 and 3.3. 

Corollary 3.1 Let x £ S. If x is no± an optimal solution to (S-QCLP) and 
ii(x) C T for some T £ R s ( p), then there exist indices i,j £ T(i f j ) such 
that X{jj} is not an optimal solution of (2-QCLP^jj). 

Proof Assume that \T\ = k. Since x is not an optimal solution of (S-QCLP), 
it follows by Lemma 3.2 that xt f UpTH' which implies that xt is not the 
optimal solution of the restricted problem 

min {p£y : ||y|| 2 < p} . 


Therefore, invoking Lemma 3.3 with d = k, q = Pt, it follows that there 
exist indices i. j £ T(i / j ) such that x, , is not an optimal solution of 
(2-QCLP {ij} ). ' □ 
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3.2 Co-Stationarity vs. CW-Maximality 

The main result of this paper is given in the following theorem, which estab¬ 
lishes the superiority of the CW-maximality condition over the co-stationarity 
condition. 

Theorem 3.1 Let x be a CW-maximum point of problem (P). Then, x is a co¬ 
stationary point of (P). 

Proof Let x be a CW-maximum point of (P). Assume by contradiction that x 
is not a co-stationary point. This means that there exists a vector v £ S such 
that 


V/(x) T (v - x) > 0. (6) 

We will show that we can find a vector z £ S 2 (x) such that 

V/(x) T (z — x) > 0. (7) 

This will imply a contradiction to the CW-maximality of x by the following 
simple argument: since / is a convex function, we have 

/(z) > /(x) + V/(x) T (z - x), 

which combined with (7) implies that 

/(z) > /(x), 

which is an obvious contradiction to the CW-maximality of x. 

Since x satisfies (6), we obviously have V/(x) f 0. Let X*(V/(x), s) be 
the set of optimal solutions of (S-QCLP) with p = V/(x) and let 

x* £ X* (V/(x), s) be some particular solution. Then, 

V/(x) T x* > V/(xfv > V/(xfx, 

and thus x f X*(V/(x), s). 

Suppose that there exists some l for which V//(x) ■ Xi < 0 (and in particular 
l £ Tl(x)). Define z as: 

’ - 1 n 3 = f 

^ J (Xj, otherwise. 

z £ S‘ 2 (x) and V/(x) T (z — x) > 0 since 

V/(x) T (z - x) = -2 • V//(x) • xi > 0. 

We have thus shown in this case the desired contradiction. From now on, we 
will therefore consider the case where Vj/(x) ■ x, > 0 for all i = 1 ,,n. 
Consider the following cases: 
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1. h(x)£I>(X7f(x.),s). 

Obviously, there is some h £ 7i(x) n 7<(V/(x),s). 

We will consider the following subcases: 

1.1. If |7>(V/(x), s)| < s, then V/,/(x) = 0 (by Lemma 2.1, part 2), and 
since V/(x) / 0, we conclude, using Lemma 2.1 (part 1), that there is 
some l £ 7> (V/(x), s). Define z as: 

f sgn (V//(x)) • (x£ + x] ) 1/2 j = l, 
j = l,...,n Zj := < 0 j = h, 

[ Xj otherwise. 


Obviously z £ ^(x), and in addition V/(x) T (z — x) > 0 since 


V/(x) T (z — x) = 


V;/(x) • sgn (V z /(x)) • (x| + x^) 1/2 - V//(x) • x, 
Vz/(x) 

Vz/(x) 

Vz/(x) 


(4 + 4) 1/2 - v //( x ) • x i 


(4 + 
((4 + 


c ?) 1 / 2 - 

4) 1/2 - 


Vz/(x)| • 
|x ; |) > 0 


14 


(V//(x) • xi > 0) 
(Vz/(x)^0,x ft ^0). 


1.2. If |7>(V/(x), s)| > s, then there is some l £ 7>(V/(x),s) such that 
l £ Ji(x). Otherwise 7>(V/(x), s) C ^(x), and since |7>(V/(x), s)| > s 
and |Ji(x)| < s, we have that 7>(V/(x), s) = 7i(x), contradicting our 
assumption that /i(x) (Z 7>(V/(x), s). We will define z as: 


j = l,...,n 


sgn(V z /(x)) • |x h | j = l, 

0 j = h, 

Xj otherwise. 


Clearly, z £ .Shfx). In addition, V/(x) T (z — x) > 0 since: 

V/(x) T (z - x) = V z /(x) • sgn (V ; /(x)) • |xft| 

-VJ(x) ■ x^ 

= |V;/(x)| • |x h | - |V/j/(x)| • |x fe | (V h /(x) • x h > 0) 
= (|Vz/(x)|-|V h /(x)|)-|x h |>0, 

where the last inequality holds since Xh =/= 0 and the indices l and h 
are such that I £ 7>(V/(x), s) and h £ I < (V/(x), s), thus according to 
Lemma 2.1 (part 3) |V//(x)| > |V ft /(x)|. 

2. 7i(x) C 7>(V/(x), s) 

Now we will consider the following subcases: 

2.1. If 7i(x) C T for some T £ 7? s (V/(x)), then since x ^ X*(V/(x), s), 
it follows that according to Corollary 3.1, there exist indices h,l £ T 
such that 


x := argmax 

y£R 2 


V {M} /(x) T y : ||y| 


i^h,l 
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satisfies 


V{ ft ,i}/(x) T x > V {fti q/(x) T x {M} . (8) 

Since |T| < s and |jx||| < 1 — i x b the vec f° r 

! xi, j = ft, 

%2, j = l, 

Xj , otherwise, 

is in ^(x), and satisfies by (8) that V/(x) T (z — x) > 0. 

2.2. If ii(x) % T for all T £ R s (Vf(x)), then: 

- Take h £ 7i(x) such that h ^ T for some T £ i? s (V/(x)). Since 
/i(x) C 7>(V/(x), s), it follows that h £ 7>(V/(x), s). Moreover, 
since />(V/(x), s) C T and ft ^ T, we have that h £ I >(V/(x), s), 
implying that h £ 7 = (V/(x), s). Thus, h £ 7 = (V/(x), s ) n 7i(x). 

- 7>(V/(x),s) % 7i(x). To show this, note that otherwise, 
7>(V/(x),s) C 7i(x), and since 7i(x) C 7>(V/(x),s) and 
|7i(x)| < s, we obtain that |7x(x)| < min {s, |7>(V/(x), s)|}, im¬ 
plying that 7i(x) C T for some T £ 7?«(V/(x)), in contradiction 
to our assumption. Thus, there exists some l £ 7>(V/(x), s) such 
that l £ 7i(x). 

Define z as: 

( sgn (V ; /(x)) • j = l, 

j = 1, ... ,ri := < 0, j = ft, 

I cCj-, otherwise. 

Clearly, z G .S' 2 (x), Furthermore, V/(x) T (z — x) > 0 since 

V/(x) T (z - x) = V;/(x) • sgn (V ; /(x)) • |x h | 

-V;,/(x) • 

= |V;/(x)| • |*h| - |v„/(x)j ■ lajfel (V h /(x) -x h >0) 
= (|Vi/(x)| - |V h /(x)|) - |* A ] > 0, 

where the last inequality holds since / 0 and the indices / and ft are 
such that Z G 7>(V/(x), s) and ft G 7 = (V/(x), s), and thus according 
to Lemma 2.1 (part 3) |V//(x)| > |V^/(x)|. 

We have thus arrived at a contradiction, and the desired implication is estab¬ 
lished. □ 

In order to show that the reverse implication is not valid, that is, that co¬ 
stationary points are not necessarily CW-maximal points, we present an ex¬ 
ample of a problem instance and a co-stationary point, that is not a CW- 
maximal point. 
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Example 3.1 For any n > s > 0, we consider problem (SPCA) with a diago¬ 
nal matrix A, whose entries on the main diagonal are given by the vector a 
defined by 


a := 


{ 2 • l„_ s \ 
y 0.5 • l s y ’ 


where for a given positive integer m, l m and 0„, are the vectors of size m 
with all entries equal to ones or zeros, respectively We also define 


0 

s -°' 5 ■ ls-lj 

It easy to see that x, x e S and that A A 0, since it is a diagonal matrix with 
positive diagonal elements. The gradient of / is given by: 

v/(x) = 2Ax = ( s _°r.y) ■ 



For any veS: 


(V/(x), V — x) = ^ S — O- 5 ^ - S — °-5) 

i=n-s -\-1 

05 ( Vi - s°A < s -0 ' 5 (llvlli - s 05 ) < 0, 

\i=n-s -\-1 / 


where the last inequality holds since ||v||i < \/||v||o||v|| 2 < x/i. Hence, x is 
co-stationary. The vector x satisfies x CE S- 2 (x) and since: 

/(x) = x 3 Ax = (s — 1) • (2s) -1 + 2s -1 = (s + 3) • (2s) -1 

> s • (2s) -1 = x 2 Ax = /(x), 

it follows that x is not a CW-maximum point. 


3.3 Support Optimality 

Theorem 3.1 establishes the relationship between the two stationarity condi¬ 
tions considered up to this point: co-stationarity and CW-maximality. A third 
condition, proposed in [4], that we will refer to as support optimality (SO), is 
given in the following definition. 

Definition 3.1 (Support Optimality) A vector x* e S' is called a support 
optimal (SO) point of (P) with respect to an index set T C {1,2, ...,n} if 
and only if it is an optimal solution of the optimization problem 

max{/(x) : ||x|| 2 < l,/i(x) C T}. 

x£M n 


(SO) 
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It is clear that, if x C S' is an optimal solution of problem (P), then it must 
be an SO point of (P) with respect to any index set T satisfying |T| < s and 
I\ ( x ) Q T. In that respect, support optimality is a necessary optimality con¬ 
dition for problem (P). It is a remarkably weak condition and cannot be used 
exclusively to derive a reasonable algorithm. Nevertheless, it is not totally 
futile. In order to enhance the performance, the CW-based algorithms that 
will be presented in Section 4 will produce a sequence of SO points, and in 
Section 5, we will adopt the variational re-normalization strategy suggested 
in [4], stating that for each sparse solution obtained by any technique, it is 
reasonable to replace this solution with the SO point that correspond to the 
same support. 

We will conclude this section with an example that demonstrates the po¬ 
tential benefit of employing algorithms that produce a point that satisfies 
stronger necessary optimality conditions. Consider the pit-prop data, which 
consists of 13 variables measuring various physical properties of 180 pit- 
props. This data set was suggested originally in [21], and since then was ex¬ 
tensively used as a benchmark example for sparse PCA; see, for example, [8, 
15,4], The problem has 13 variables and we consider a sparsity level of s = 4. 
Note that we can list all the ( 1 4 3 ) = 715 SO points that correspond to index 
sets with exactly 4 indices, and the optimal solution must be one of these 715 
points. Out of this set of points, 28 satisfy the co-stationarity condition and 
only 2 satisfy the CW-maximality condition. The following table presents the 
support sets of each of the co-stationarity points along with their function 
values. 


Table 1 The supports of the co-stationary points for the pit prop data. 


# 

Support 

CW-maximum 

Value 

# 

Support 

CW-maximum 

Value 

1 

{1,2,9,10} 

* 

2.937 

15 

{5,6,7,10} 


2.337 

2 

{1,2,7,10} 


2.883 

16 

{7,8,10,12} 


2.314 

3 

{1,2,7,9} 


2.859 

17 

{7,8,10,13} 


2.302 

4 

{1,2,8,9} 


2.797 

18 

{5,6,7,13} 


2.28 

5 

{1,2,8,10} 


2.759 

19 

{3,4,6,7} 


2.209 

6 

{1,2,6,7} 


2.697 

20 

{4,5,6,7} 


2.196 

7 

{2,7,9,10} 


2.696 

21 

{7,10,12,13} 


2.136 

8 

{2,6,7,10} 


2.592 

22 

{3,4,8,12} 


1.995 

9 

{1,6,7,10} 


2.587 

23 

{3,4,10,12} 


1.992 

10 

{1,2,3,4} 

* 

2.563 

24 

{3,10,11,12} 


1.609 

11 

{7,8,9,10} 


2.549 

25 

{3,5,12,13} 


1.516 

12 

{6,7,9,10} 


2.522 

26 

{1,5,12,13} 


1.414 

13 

{6,7,10,13} 


2.459 

27 

{2,5,12,13} 


1.408 

14 

{6,7,8,10} 


2.444 

28 

{3,5,11,13} 


1.382 


Since the number of CW-maximum points is significantly smaller than 
the number of co-stationary points, it is much more probable that the optimal 
solution will be found by an algorithm that produces CW-maximum points 
than an algorithm that produces co-stationary points. 
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4 Algorithms 

In this section, we will present two CW-based algorithms - GCW and PCW 
- that are guaranteed to converge after a finite amount of iterations to a a 
CW-maxima. Later on, in Section 5, we will demonstrate the superiority of 
these algorithm over methods which are based on the co-stationarity opti¬ 
mality condition such as the conditional gradient algorithm with unit step- 
size (ConGradU), that was suggested in [16], where it was also proven that 
limit points of the sequence generated by ConGradU are co-stationary point. 

In [18] several algorithms that produce a CW-minimum point were con¬ 
sidered. These block coordinate descent type algorithms perform at each iter¬ 
ation an optimization step with respect to one or two variables, while keeping 
the rest fixed. The coordinates that need to be altered are chosen to be the ones 
that produce the maximal decrease among all possible alternatives, or by ap¬ 
plying an index selection strategy based on a local first order information. We 
adopt this approach and present similar algorithms for the sparse PCA prob¬ 
lem. 

At each iteration of a CW-based algorithm applied to (P), at most two vari¬ 
ables will be updated. We can categorize each of the iterations according to 
whether the support is altered or not. Block coordinate algorithms suffer from 
a major drawback - a slow convergence rate. In order to reduce the effect of 
this displeasing characteristic, we will replace the point obtained at each step 
with an SO point that corresponds to the same support. This modification 
allows us to bypass the large amount of iterations that should have been de¬ 
voted for optimizing the variables with respect to a fixed support. 

Below we present the Greedy CW (GCW) algorithm. We denote by 0(T) 
an oracle that produce an SO point with respect to a given support T by solv¬ 
ing problem (SO). We will refer to this oracle as an SO oracle. In the specific 
case of the PCA problem, the SO oracle amounts to finding a normalized prin¬ 
cipal eigenvector of a submatrix of the covariance matrix. However, finding 
the maximum of a general convex function / over a unit ball is in principle 
a difficult task. We will assume that the solution produced by the oracle is 
uniquely defined by T. In addition, note that the oracle outputs an optimal 
solution of a problem consisting of maximizing a convex function over a com¬ 
pact and convex feasible set, and hence by [20, Corollary 32.3.2], there exists 
an optimal solution of the problem which is an extreme point. In particular, 
this means that we can assume without any loss of generality that the oracle 
outputs a vector with norm 1. This assumption will made from now on. 


The Greedy CW (GCW) Algorithm 

Input: / : K" —> K. - convex function; O(-) - SO oracle; s - sparsity level. 
Output: x - a CW-maximum point of (SPCA). 

Initialization: Take T £ {1,2,..., n} such that 1 < |T| < s and set 
x° = 0(T) and k = 0. 

General step: 
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1. While ||x fc ||o < s, compute 


j k G argmax {/(z) : z = 0(h (x fc ) U {j})} , 

1 Tr>('xk\ 


je/ 0 (x fc ) 


If /(G(/i(x fc ) U {j fc })) > /(x fe ), then set 

x H1 = 0(i 1 (x‘)U{i t }) 1 


k = k + 1, 


and return to 1; otherwise, go to 2. 

2. For every i G Ii(x fe ) and j G /o(x fe ) compute 



Let (ifc.jfc) = argmax{/ij : i G /i(x fe ), j G I 0 (x fc )}- If fi k ,j k > /(x fc ). 


then set 


.fe+i _ 




x 


k = k + 1, 

and return to 1 . 

Otherwise, STOP and set x t— x ,,:+ i . 


Step 1 of the GCW algorithm is in fact the greedy forward selection method 
proposed in [4]. Hence, in some sense, the GCW method is a generalization 
of this method, that does not terminate at the moment that a solution with a 
full support is obtained. However, from a more practical point of view, this 
resemblance is irrelevant due to the fact that, if the initial support satisfies 
\T\ = s, then the condition iixio < s will probably be false for all k in any 
reasonable practical scenario. 

The following theorem summarizes the key properties of the GCW algorithm. 

Theorem 4.1 Let {x fc } be the sequence generated by the GCW algorithm. Then , the 
folloiving statements hold. 

(i) The sequence of function values {/(x fc )} is monotonically increasing. 

(ii) The algorithm terminates after a finite amount of iterations. 

(iii) At termination , the algorithm produces a CW maximum point. 

Proof Part ( i ) follows immediately from the description of the GCW algo¬ 
rithm. Part (ii) is a consequence of the monotonicity of the algorithm (part 
(*)) and the fact that it only passes through SO points, from which there is 
only a finite number under the standing assumption that the solution pro¬ 
duced by the oracle 0(T) is uniquely defined by T. 
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To prove (Hi), consider the following partition of 62 (x): 

5 2 (x) = {z : ||z — x|| 0 <2,zeS} 

= S$(x) U5|(x) USf(x), 

where 

S 2 °(x) = {z £ S : ||z - x||o < 2, Ji(z) C /i(x)} 

S 2 (x) = {z £ 5 : jjz - xjjo < 2,/i(z) = /i(x) U {j}, j £ I 0 (x)} 

'S'l(x) = 

{ze S : 11 z x 11 0 < 2,/i(z) = (Ii(x) \ {*}) U {j},i £ Ii(x), j £ J 0 (x)}, 

and assume that the algorithm produced the point x. Since x is an SO point 
and S'!, (x) C {x : ||x|| 2 < l,/i(x) C T^x)}, it follows that /(x) > /(x) for any 
x £ S 2 (x). Now, note that the algorithm terminates only if after performing 
Step 2 we obtain that for any i £ /1 (x) and j £ /q (x) 


fij = max CTe{ _ ljl} {/(x - x&i + a\xi\ej)} 

= max Q {/ (x - x^ + aej) : a £ [—|»j|, |xj|]} 

< /(x), 

where the first equality is due to the fact that the maximum of a convex func¬ 
tion over a compact and convex set is attained at an extreme point, see [20, 
Corolalry 32.3.2]. Thus, /(x) > /(x) for any x £ 5|(x). This is enough for 
proving that x is CW-maximal in the case when ||x||o = s since in this case 
S.) (x) = 0. If ||x||o < s, then prior to entering Step 2, Step 1 must be per¬ 
formed. This step is terminated only if /(x) > /(x) for any 

x £ {z £ S : Ii(z) = 7i(x) U {j},j £ 7 0 (x)}, 

and since ^(x) C {z £ S : 7i(z) = /i(x) U {j},j £ /o(x)}, it implies 
that /(x) > /(x) for any x £ S'l(x), concluding that /(x) > /(x) for any 
x £ S 2 (x). □ 

Practically, if the initial support T satisfies \T\ = s, then most of the com¬ 
putation time in the GCW method is consumed in computing f lt j for each 
possible swap.This observation encourages us to consider the following vari¬ 
ation of GCW, which we name the Partial CW (PCW) algorithm. 


The Partial CW (PCW) Algorithm 

Input: / : K" — > 1 - convex function; O(-) - SO oracle; s - sparsity level. 
Output: x - a CW-maximum point of (SPCA). 

Initialization: Take T £ {1,2,..., n} such that 1 < \T\ < s and set 
x° = 0(T) and k = 0. 

General step: 
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1. While ||x fc ||o < s, compute 

jk S argmax {/(z) : z = 0(/i(x fc ) U {j})} , 

je/ 0 (x fc ) 

If/(©(/!(x fc ) U {j fc })) > /(x fe ), then set 

x‘ +1 = 0(I 1 (x‘)U{i t }) I 
k = k + 1, 

and return to 1; otherwise, go to 2. 

2. Set f? = Ji(x fe ). 

While \R\ > 0 

Set 4 G argmin {\x^\ : i € i?} and for each j G 7 0 (x fe ) compute 
fi k d = max {/ (x fc - ei fc + <r| | ej )} . 

Let j k e argmax { f ik : j G J 0 (x fc )}. 

If > /(x fc ), then set 

x fc+1 = O ((/i(x fc ) \ {4}) U {>}) , 
k = k + 1, 

and return to 1 . 

Otherwise, set f? = it. \ {4}- 
STOP and set x t— x fc+1 . 


Before termination, PCW will perform the computation of all possible f t j, 
thus assuring the convergence to a CW-maximum point, given that the out¬ 
put is of a full support. For the general step, the amount of computation will 
significantly decrease on the expense of finding the indices that provide the 
maximal increase in the function value. Nevertheless, the empirical study 
suggests that PCW provides similar results as GCW with respect to function 
values in a fraction of the time, as demonstrated in Section 5. 


5 Numerical Results 

We will illustrate the effectiveness of the algorithms proposed in the previous 
section on simulated and a gene expression datasets. We compared the results 
with the following alternative algorithms: the novel /{(-constrained version 
of ConGradU [16], the expectation maximization [11], approximate greedy 
[6] and thresholding [5]. The MATLAB implementation of ConGradU was 
kindly provided by the authors, for all the other alternative algorithms we 
used a MATLAB implementation available on the authors' web-pages. For 
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the thresholding algorithm and the algorithms proposed in this paper, we 
used a MATLAB implementation, which is available in the following URL: 

http://tx.technion.ac.il/-yakovv/packages/CW_PCA.zip 
Whenever an initialization is required, we set the initial point to be the so¬ 
lution of the thresholding method. Regarding the output, we adopt the vari¬ 
ational renormalization strategy suggested in [ ]. Hence, for each of the al¬ 
gorithms, we extracted the sparsity pattern (the set of indices of the nonzero 
elements). The actual output vector is determined to be equal to 0{T), where 
T is the generated sparsity pattern. The experiments were conduced on a PC 
with a 3.40GHz processor with 16GB RAM. 

5.1 Random Data 

The covariance matrix A is given by A = D T D, where D is the so-called 
"data matrix". Each entry in the data matrix D £ R mxr ' was randomly gener¬ 
ated according to the Gaussian distribution with zero mean and variance 1 /m 
(Di j ~ A/”(0,1/m)). We considered data matrices with n = 2000, 5000,10,000 
and 50, 000 variables. The number of observations is set to m = 150 for all 
matrices. The sparsity levels considered are s = 5,10,..., 250, and for each 
sparsity level we generated 100 realizations. We will measure the effective¬ 
ness of the algorithms according to the average proportion of variability ex¬ 
plained by the algorithm with respect to the largest eigenvalue of the data 
covariance matrix (i.e., x t Ax/Ai(A), where x is the solution and Ai(A) is 
the largest eigenvalue of A). 

5.1.1 GCWvs. PCW 

First, we would like to compare the effectiveness and performance of the C W- 
based algorithms proposed in the previous section: GCW and PCW. We con¬ 
ducted the comparison based on data matrices with 2,000 variables and the 
results are given in Figure 1. 


Proportion of Explained Variance 
as a Function of Sparsity Level (n=2k) 



Running Time as a Function of Sparsity Level (n=2k) 



Fig. 1 GCW vs. PCW - The proportion of explained variability is given in the left figure and the com¬ 
putation time is given in the right one. The plot in both figures are given as a function of the sparsity 
level. 
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We can clearly see that both methods achieve similar results with respect 
to the function values, while PCW achieves these results in a fraction of the 
time. Thus, in the remaining numerical study we will omit GCW. Although 
the partial version remarkably reduces the computation time, it is still not 
competitive for very large-scale problems when a full path of solutions is 
required. Thus, for such cases, we will also examine the effect of initializing 
PCW with the solution of the previous run (with the smaller sparsity level), 
and we will refer to such a continuation scheme as PCW cont . 

5.1.2 PCW vs. Alternative Methods 

We will now compare the effectiveness and performance of PCW with respect 
to the alternative algorithms mentioned earlier. The setting for this set of ex¬ 
periments is the same as the one described in the previous example, but with 
problems with n = 5, 000, 10,000 and 50,000 variables. Figure 2 provides the 
proportion of explained variability as a function of the sparsity level. 



-e- pcw coi 
-I-PCW 
-t-CGU 
-b-EM 
-*-aGr 
— Trsh 


Proportion of Explained Variance 
as a Function of Sparsity Level (n=5k) 


Sparsity Level 


Proportion of Explained Variance 
as a Function of Sparsity Level (n=10k) 



Proportion of Explained Variance 
as a Function of Sparsity Level (n=50k) 



Fig. 2 PCW vs. Others - The proportion of explained variability as a function of the sparsity level for 
n = 5000,10, 000 and 50, 000 are given in the upper left, upper right and bottom figures, respectively. 


For small sparsity levels (< 50) most of the algorithms provide similar re¬ 
sults, but as the sparsity level is increased, the CW algorithms becomes supe¬ 
rior to all the other methods. This advantage is not achieved without a price. 
In Figure 3 we provide the cumulative computation time of the algorithms 
(the cumulative time is considered since the approximate greedy algorithm 
provides a full set of solutions). 












22 


Amir Beck, Yakov Vaisbourd 


Cumulative Running Time 
as a Function of Sparsity Level (n=5k) 



Cumulative Running Time 
as a Function of Sparsity Level (n=10k) 



Cumulative Running Time 
as a Function of Sparsity Level (n=50k) 



Fig. 3 PCW vs. Alternative Methods - The cumulative computation times as a function of the sparsity 
level for n = 5, 000,10, 000 and 50, 000 are given in the upper left, upper right and bottom, respectively. 
The SVD time is the time required for computing the principal eigenvector of the covariance matrix that 
corresponds to the generated data, which is used in order to find the thresholding solution, and in order to 
initialize the CW and ConGradU algorithms. 


Even though PCW has greatly decreased the computation time with re¬ 
spect to GCW, it still requires a notably higher amount of computation time 
with respect to the alternative algorithms. The scheme we referred as PCW,, mt 
achieves similar results to PCW with respect to the function value. Regard¬ 
ing the running time, this scheme is competitive to the EM algorithm and 
requires somewhat more computational effort than the ConGradU and ap¬ 
proximate greedy algorithms, thus providing a reasonable approach when a 
full set of solutions is required. 


5.2 Gene Expression Dataset 

Sparse PCA is extensively utilized in the identification of the genes that re¬ 
flect the changes in the gene expression patterns during different biological 
states, thus contributing to the diagnosis and research of certain diseases such 
as cancer. Figure 4 illustrates the proportion of explained variability and the 
cumulative running time for a Leukemia data set [22]. This data set is com¬ 
posed from gene expression profiles of 72 patients with 12582 genes. The data 
set is normalized such that it has zero mean and unit variance. 
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Proportion of Explained Variance 
as a Function of Sparsity Level (Leukemia) 


Cumulative Running Time 
as a Function of Sparsity Level (Leukemia) 




Fig. 4 Leukemia Gene Expression Data - The proportion of explained variability is given in the left figure 
and the cumulative computation time is given in the right one. The plot in both figures are given as a 
function of the sparsity level. 


Most of the algorithms under consideration provide similar results with 
respect to the explained variability, which might indicate that this problem 
is, in a sense, rather easy to solve. We conducted similar experiments for ad¬ 
ditional 20 gene expression data sets from the GeneChip oncology database 
[23] that is publicly available in: 

http://compbio.dfci.harvard.edu/compbio/tools/gcod 

while commonly, all the algorithms provided similar results, we can still see 
in Figure 5 that PCW yields the best solution (with respect to the function 
value) more times than the alternative algorithms, and consequently it ob¬ 
tains the smallest mean error with respect to the best solution. 




Fig. 5 Gene Expression Data - The left figure illustrates for each sparsity level the proportion of the 
number of data sets for which each algorithm obtained the best solution. The right figure illustrates for each 
sparsity level the mean error with respect to the best solution (the approximate greedy and thresholding 
algorithms were disregarded since both of them provide relative poor results). 
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6 Conclusions 

In this paper, we considered the problem of maximizing a continuously dif¬ 
ferentiable convex function over the intersection of an I 2 unit ball and a spar¬ 
sity constraint. We have shown that coordinate-wise maximality is a more 
restrictive condition than co-stationarity, which is the basis of many well- 
known methods for solving the sparse PCA problem. We introduced two al¬ 
gorithms (GCW and PCW) that are guaranteed to produce a CW-maximal 
solution, and demonstrated empirically the potential benefit of using this al¬ 
gorithms over some common algorithms proposed for this problem. 

Acknowledgements We would like to thank two anonymous referees for their helpful remarks 
that helped to improve the presentation of the paper. 
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